Goto

Collaborating Authors

 layer configuration 1


Parameter Tuning

Neural Information Processing Systems

If observations from the joint distribution of (A,Y,Z,W) are available in both stages, we can tune the regularization parameters λ1,λ2 using the approach proposed in Singh et al. [30], Xu et al. [35]. Let the complete data of stage 1 and stage 2 be denoted as (ai,yi,zi,wi) and ( ai, yi, zi, wi). Then, we can use the data not used in each stage to evaluate the out-of-sample performance of the other stage. A(2), ˆV(T),u(T) are the learned parameters by Algorithm 1. In this appendix, we prove propositions given in the main text. In the following, we assume that the spaces U, A, Z,W are separable and completely metrizable topological spaces and equipped with Borel σ-algebras. In this section, we use the notation PA|Z=z to express the distribution of a random variable Agiven another variable Z = z.


ParameterTuning

Neural Information Processing Systems

Then, we can use the data not used ineach stage toevaluate the out-of-sample performance ofthe other stage. Assumption 4. For each a A, the operator Ea is compact with singular system TheoperatorEa (denoted byKx in [20]) is defined by the relevant densities accordingly (see the paragraph after Lemma 2 of [20]). It is easy to see that Assumptions 4 and 5 are required for using Proposition 5. Remark 2. The difference between the first condition in Assumption 2 and Condition 3 in [20] is in the approach to establishing that the conditional expectationE[Y|A=a,Z = ] belongs to N(Fa) .Morespecifically,Condition 3in[20]isequivalent tohavingN(Fa)={0}andsoany nontrivialL2(PZ|A=a)-function is in the orthogonal complement. Lemma 2. Under Assumptions 1, 2, 4 and 5, for eacha A, there exists a function h a By Lemma 1, the regression functionE[Y|A=a,Z = ] is in N(E a) . For simplicity, we set all regularization terms to zero.